-
Notifications
You must be signed in to change notification settings - Fork 207
feat(validator): add support to validate essential metrics produced by Kepler #1834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(validator): add support to validate essential metrics produced by Kepler #1834
Conversation
…1 updates Bumps the go-dependencies group with 8 updates in the / directory: | Package | From | To | | --- | --- | --- | | [github.com/beevik/etree](https://github.com/beevik/etree) | `1.4.0` | `1.4.1` | | [github.com/cilium/ebpf](https://github.com/cilium/ebpf) | `0.15.0` | `0.16.0` | | [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) | `2.19.1` | `2.20.0` | | [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) | `1.19.1` | `1.20.0` | | [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) | `0.53.1` | `0.54.0` | | [golang.org/x/time](https://github.com/golang/time) | `0.5.0` | `0.6.0` | | [k8s.io/api](https://github.com/kubernetes/api) | `0.29.7` | `0.29.8` | | [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.29.7` | `0.29.8` | Updates `github.com/beevik/etree` from 1.4.0 to 1.4.1 - [Release notes](https://github.com/beevik/etree/releases) - [Changelog](https://github.com/beevik/etree/blob/main/RELEASE_NOTES.md) - [Commits](beevik/etree@v1.4.0...v1.4.1) Updates `github.com/cilium/ebpf` from 0.15.0 to 0.16.0 - [Release notes](https://github.com/cilium/ebpf/releases) - [Commits](cilium/ebpf@v0.15.0...v0.16.0) Updates `github.com/onsi/ginkgo/v2` from 2.19.1 to 2.20.0 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.19.1...v2.20.0) Updates `github.com/prometheus/client_golang` from 1.19.1 to 1.20.0 - [Release notes](https://github.com/prometheus/client_golang/releases) - [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md) - [Commits](prometheus/client_golang@v1.19.1...v1.20.0) Updates `github.com/prometheus/prometheus` from 0.53.1 to 0.54.0 - [Release notes](https://github.com/prometheus/prometheus/releases) - [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md) - [Commits](prometheus/prometheus@v0.53.1...v0.54.0) Updates `golang.org/x/sys` from 0.22.0 to 0.23.0 - [Commits](golang/sys@v0.22.0...v0.23.0) Updates `golang.org/x/time` from 0.5.0 to 0.6.0 - [Commits](golang/time@v0.5.0...v0.6.0) Updates `k8s.io/api` from 0.29.7 to 0.29.8 - [Commits](kubernetes/api@v0.29.7...v0.29.8) Updates `k8s.io/apimachinery` from 0.29.7 to 0.29.8 - [Commits](kubernetes/apimachinery@v0.29.7...v0.29.8) Updates `k8s.io/client-go` from 0.29.7 to 0.29.8 - [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md) - [Commits](kubernetes/client-go@v0.29.7...v0.29.8) Updates `k8s.io/klog/v2` from 2.120.1 to 2.130.1 - [Release notes](https://github.com/kubernetes/klog/releases) - [Changelog](https://github.com/kubernetes/klog/blob/main/RELEASE.md) - [Commits](kubernetes/klog@v2.120.1...v2.130.1) --- updated-dependencies: - dependency-name: github.com/beevik/etree dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/cilium/ebpf dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/ginkgo/v2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/client_golang dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/prometheus dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: golang.org/x/sys dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: golang.org/x/time dependency-type: indirect update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: k8s.io/api dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/apimachinery dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/client-go dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/klog/v2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/go_modules/go-dependencies-bb1f50d887 build(deps): bump the go-dependencies group across 1 directory with 11 updates
…updates Bumps the github-actions group with 5 updates in the / directory: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `3` | `4` | | [anchore/sbom-action](https://github.com/anchore/sbom-action) | `0.16.1` | `0.17.1` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4.3.4` | `4.3.6` | | [actions/setup-python](https://github.com/actions/setup-python) | `3` | `5` | | [ossf/scorecard-action](https://github.com/ossf/scorecard-action) | `2.3.3` | `2.4.0` | Updates `actions/checkout` from 3 to 4 - [Release notes](https://github.com/actions/checkout/releases) - [Commits](actions/checkout@v3...v4) Updates `anchore/sbom-action` from 0.16.1 to 0.17.1 - [Release notes](https://github.com/anchore/sbom-action/releases) - [Commits](anchore/sbom-action@v0.16.1...v0.17.1) Updates `actions/upload-artifact` from 4.3.4 to 4.3.6 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4.3.4...v4.3.6) Updates `actions/setup-python` from 3 to 5 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v3...v5) Updates `ossf/scorecard-action` from 2.3.3 to 2.4.0 - [Release notes](https://github.com/ossf/scorecard-action/releases) - [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md) - [Commits](ossf/scorecard-action@dc50aa9...62b2cac) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: anchore/sbom-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: ossf/scorecard-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/github_actions/github-actions-5a7b011f50 build(deps): bump the github-actions group across 1 directory with 5 updates
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…server-patch-1 feat: add model_name attribute to ComponentModelWeights
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
Signed-off-by: Sunil Thaha <[email protected]>
…r-longer-test chore(validator): run stress test for longer
This commit introduces a workflow for testing ACPI functionality using Equinix self-hosted runners. The workflow deploys Kepler using mock-acpi compose setup and runs validator to ensure functionality. Key-features: - Workflow is triggered on pull requests that include a specific commit message `/test-acpi`. - Environment setup is handled by ansible. Signed-off-by: Vibhu Prashar <[email protected]>
…puting-io/add-acpi-wkf feat(ci): implement mock-ACPI workflow
…server-patch-1 fix: format ComponentModelWeights
This commit moves model_weights from/var/lib/kepler/data/ to its own directory - var/lib/kepler/data/model_weights/ this allows additional data like machine-spec to be stored its own directory. Additionally this change fixes the blank cpu.yaml that gets created when running compose files. Signed-off-by: Sunil Thaha <[email protected]>
…k-cpu-yaml chore: move model_weights to its own directory
Signed-off-by: Maryam Tahhan <[email protected]>
This commit resolves two key issues with the mock-acpi workflow: - Checkout correct branch: The workflow previously checkout out the default branch when triggered by a pull request. This fix ensures that the correct pull request branch is checked out during CI run. - Attach workflow to pull request checks: The workflow was not being reflected under pull request checks. With this fix, the workflow will be correctly attached, ensuring its status visible and reported under pull request checks. Signed-off-by: Vibhu Prashar <[email protected]>
…eanup-exporter-globals chore: cleanup globals in exporter
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…server-patch-1 fix: set default trainer only for local regressor
…pi-wk-status fix(ci): ensure proper status reporting for mock-acpi workflow
This commit addresses an issue where the `cleanup` and `final status` jobs were incorrectly dependent on the `create-runner` job, leading to their premature execution. This fix ensures that both `cleanup` and `final status` job run independently and only after all necessary preceding jobs have finished Signed-off-by: Vibhu Prashar <[email protected]>
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…server-patch-1 feat: add --disable-power-meter option
…x-job-flow fix(ci): ensure independent execution of cleanup and status jobs
…idation feat: Export validation result as json object.
Signed-off-by: Vimal Kumar <[email protected]>
Signed-off-by: Vimal Kumar <[email protected]>
Signed-off-by: Maryam Tahhan <[email protected]>
This commit addresses the issue of multiple jobs defined in the `mock-acpi` workflow, which were unintentionally executing in parallel due to sequence constraints. By consolidating the workflow into a single job, we ensure that the tasks are executed sequentially Signed-off-by: Vibhu Prashar <[email protected]>
…puting-io/fix-flow fix(ci): consolidate mock-acpi workflow into single job
…e-computing-io#1876) Signed-off-by: Huamin Chen <[email protected]>
This commit migrates the mock-acpi workflow to use the GitHub runner instead of the Equinix self-hosted runner. Since the workflow is designed for testing ACPI functionality, using a mock, a self-hosted runner is not required. Running the workflow on the GitHub runner, which operates as a VM, enables execution on every pull request, ensuring consistent validation of ACPI functionality for Kepler. Signed-off-by: vprashar2929 <[email protected]>
…pi-wkf chore(ci): migrate mock-acpi workflow to GH runner
* [test]: add test case on package cgroup Signed-off-by: Sam Yuan <[email protected]> * [fix]: update cache setting logic when error happen Signed-off-by: Sam Yuan <[email protected]> * [fix]: fix lint Signed-off-by: Sam Yuan <[email protected]> --------- Signed-off-by: Sam Yuan <[email protected]>
…om-scaph feat(compose): add fallback scrape protocol for Scaphandre service
Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Sam Yuan <[email protected]>
Signed-off-by: Mario Vazquez <[email protected]>
Fixes paths for grace arm
Signed-off-by: Huamin Chen <[email protected]>
…race feat(sensor): support NVIDIA Grace Hopper
…e-computing-io#1889) Bumps the github-actions group with 4 updates: [actions/checkout](https://github.com/actions/checkout), [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance), [actions/attest-sbom](https://github.com/actions/attest-sbom) and [codecov/codecov-action](https://github.com/codecov/codecov-action). Updates `actions/checkout` from 3 to 4 - [Release notes](https://github.com/actions/checkout/releases) - [Commits](actions/checkout@v3...v4) Updates `actions/attest-build-provenance` from 1 to 2 - [Release notes](https://github.com/actions/attest-build-provenance/releases) - [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md) - [Commits](actions/attest-build-provenance@v1...v2) Updates `actions/attest-sbom` from 1 to 2 - [Release notes](https://github.com/actions/attest-sbom/releases) - [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md) - [Commits](actions/attest-sbom@v1...v2) Updates `codecov/codecov-action` from 5.0.7 to 5.1.1 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v5.0.7...v5.1.1) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/attest-build-provenance dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/attest-sbom dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ble-computing-io#1892) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.26.0 to 0.31.0. - [Commits](golang/crypto@v0.26.0...v0.31.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
5fa8028
to
3c994bb
Compare
metal: metal # Job name for metal metrics, default is metal | ||
|
||
url: http://localhost:9090 # Prometheus server URL | ||
rate_interval: 60s # Rate interval for Promql, default is 20s, typically 4 x $scrape_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly using rate interval as 60s because:
Prometheus scrape Interval = 3s
Data points for 12s Interval(i.e 4* scrape interval) = 12/3 = 4 data points
Data points for 60s interval = 60/3 = 20 data points
With 20 data points, we get a smoother and more reliable estimate. When comparing two sum(rate(...))
a stable rate reduces the variability in MAE calculations leading to more accurate assessments.
28889fe
to
fe32d16
Compare
@@ -1,5 +1,5 @@ | |||
global: | |||
scrape_interval: 5s # Set the scrape interval to every 5 seconds. Default is every 1 minute. | |||
scrape_interval: 3s # Set the scrape interval to every 5 seconds. Default is every 1 minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check why changed scrape interval, and update comment accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting scrape every 3 seconds rather than every 5 seconds, over a typical time window will collect significantly more data
fe32d16
to
990c844
Compare
Here is sample CI run that would look like for reference once we have this merged: https://github.com/sustainable-computing-io/kepler-metal-ci/actions/runs/12366281744/job/34512777104 My idea is to use the equinix runners on demand on PR's. Reviewers or authors can add a comment in the PR something like |
…y Kepler This commit introduces functionality to validate essential metrics produced by Kepler The following comparisons are included: - Node Exporter Comparison - Validates `node_rapl_<package|core|dram>` metrics against `kepler_node_<package|core|dram>{dev}` - Kepler Process Comparison - Compares `kepler_process_<package|core|dram|platform|other|uncore>{latest}` metrics to `kepler_process_<package|core|dram|platform|other|uncore>{dev}` - Kepler Node Comparison - Validates `kepler_node_<package|core|dram|platform|other|uncore>{latest}` against `kepler_node_<package|core|dram|platform|other|uncore>{dev}` Additionally, the following changes are made to existing functionality: - Adds a new `metric_validations.yaml` file which includes promql queries for comparisons along with threshold values - Update the existing `stressor.sh` script to now support few more parameters to make it more flexible - warmup time: time to wait before starting the stressor - cooldown time: time to wait after the stressor is finished - repeats: number of times to repeat the stressor. Since for regression test we don't want to repeat the stressor multiple times - Adds a new `validator-regression.yaml` file which includes the configuration for the regression test Signed-off-by: vprashar2929 <[email protected]>
990c844
to
b06242b
Compare
@vprashar2929 can we bring this into |
This commit introduces functionality to validate essential metrics produced by Kepler
The following comparisons are included:
Node Exporter Comparison
node_rapl_<package|core|dram>
metrics againstkepler_node_<package|core|dram>{dev}
Kepler Process Comparison
kepler_process_<package|core|dram|platform|other|uncore>{latest}
metrics tokepler_process_<package|core|dram|platform|other|uncore>{dev}
Kepler Node Comparison
kepler_node_<package|core|dram|platform|other|uncore>{latest}
againstkepler_node_<package|core|dram|platform|other|uncore>{dev}
Additionally, the following changes are made to existing functionality:
metric_validations.yaml
file which includes promql queries for comparisons along with threshold valuesstressor.sh
script to now support few more parameters to make it more flexibleregression test we don't want to repeat the stressor multiple times
validator-regression.yaml
file which includes the configuration for the regression test